Skip to content

[FLINK-39306][flink-autoscaler] Non-source vertices do not use per-second rate metrics, producing inaccurate scaling decisions#1078

Open
Dennis-Mircea wants to merge 1 commit intoapache:mainfrom
Dennis-Mircea:FLINK-39306
Open

[FLINK-39306][flink-autoscaler] Non-source vertices do not use per-second rate metrics, producing inaccurate scaling decisions#1078
Dennis-Mircea wants to merge 1 commit intoapache:mainfrom
Dennis-Mircea:FLINK-39306

Conversation

@Dennis-Mircea
Copy link
Copy Markdown
Contributor

What is the purpose of the change

Fixes the inaccurate scaling metric computation for non-source vertices by enabling per-second rate metrics (numRecordsInPerSecond / numRecordsOutPerSecond) collection and consumption across the full autoscaler pipeline, and replaces endpoint-only getRate() with spike-resilient alternatives for gauge metrics like LAG.

JIRA: https://issues.apache.org/jira/browse/FLINK-39306

Brief change log

  • FlinkMetric: Added NUM_RECORDS_IN_PER_SEC and NUM_RECORDS_OUT_PER_SEC enum entries to match Flink's numRecordsInPerSecond / numRecordsOutPerSecond task-level metrics.
  • ScalingMetric: Added NUM_RECORDS_IN_PER_SECOND and NUM_RECORDS_OUT_PER_SECOND scaling metric entries for storing per-second rates in the metrics history.
  • ScalingMetricCollector: Extended getFilteredVertexMetricNames to request NUM_RECORDS_IN_PER_SEC and NUM_RECORDS_OUT_PER_SEC for non-source vertices (previously only source-specific metrics were requested).
  • ScalingMetrics: Extended computeDataRateMetrics to store NUM_RECORDS_IN_PER_SECOND and NUM_RECORDS_OUT_PER_SECOND in the collected scaling metrics for non-source vertices when available.
  • ScalingMetricEvaluator:
    • Introduced getAverageWithRateFallback(perSecondMetric, accumulatedMetric, ...) - tries getAverage(perSecondMetric) first (direct per-second rate from Flink), falls back to getRate(accumulatedMetric) (endpoint-based delta from accumulated counters) when the per-second metric is unavailable.
    • Introduced getAverageRate(metric, ...) - computes the average of per-interval deltas across the full metrics window, replacing the spike-susceptible endpoint-only getRate() for gauge metrics like LAG.
    • Replaced all getRate(NUM_RECORDS_IN, ...) / getRate(NUM_RECORDS_OUT, ...) calls in isProcessingBacklog, evaluateMetrics, computeEdgeOutputRatio, and computeEdgeDataRate with getAverageWithRateFallback(NUM_RECORDS_IN_PER_SECOND, NUM_RECORDS_IN, ...) / getAverageWithRateFallback(NUM_RECORDS_OUT_PER_SECOND, NUM_RECORDS_OUT, ...).
    • Replaced getRate(LAG, ...) in computeTargetDataRate with getAverageRate(LAG, ...) to avoid diluted lag rate estimation when the metrics window is large.
  • JobAutoScalerImplTest: Fixed flaky testMetricReporting by injecting a deterministic Clock via autoscaler.setClock() to guarantee distinct timestamps across metric collections, avoiding the timestamp collision that caused the metrics history to stay at size 1.

Verifying this change

This change can be verified by existing unit tests after the flaky test fix. Additional UTs to be added to validate:

  • getAverageWithRateFallback returns the per-second average when available and falls back to getRate when it is not.
  • getAverageRate computes the correct average of per-interval deltas and is resilient to spikes diluted over a large window.
  • Non-source vertices correctly collect, store, and evaluate NUM_RECORDS_IN_PER_SECOND / NUM_RECORDS_OUT_PER_SECOND.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

…cond rate metrics, producing inaccurate scaling decisions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant